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Abstract 

This paper studies properties of the score distributions 
of calibrated log-HkeUhood-ratios that are used in auto- 
matic speaker recognition. We derive the essential con- 
dition for cahbration that the log likelihood ratio of the 
log-likelihood-ratio is the log-likelihood-ratio. We then 
investigate what the consequence of this condition is 
to the probability density functions (PDFs) of the log- 
likelihood-ratio score. We show that if the PDF of the 
non-target distribution is Gaussian, then the PDF of the 
target distribution must be Gaussian as well. The means 
and variances of these two PDFs are interrelated, and de- 
termined completely by the discrimination performance 
of the recognizer characterized by the equal error rate. 
These relations allow for a new way of computing the 
offset and scaling parameters for linear calibration, and 
we derive closed-form expressions for these and show 
that for modem i-vector systems with PLDA scoring this 
leads to good calibration, comparable to traditional logis- 
tic regression, over a wide range of system performance. 

1. Introduction 

In recent years, calibration in automatic speaker recognition 
has received more attention fTlJlll. Intuitively, calibration is 
related to the ability to properly set a threshold in a speaker 
detection system so as to minimize the expected error ]I2| . 
In speaker detection, the task is to decide whether or not two 
speech signals originate from the same speaker. Because all 
speaker recognition systems internally work with some scalar 
score that expresses speaker similarity, a score threshold can 
control the trade-off between the two types of errors that a sys- 
tem can make |1 T3]|14| . Indeed, in the series of NIST Speaker 
Recognition Evaluations (SRE) the primary evaluation measure 
has been sensitive to calibration. Until SRE 2010, calibration 
was assessed in a single operating point, through a single de- 
cision cost function known as Cdct- Also other technologies 
in speech technology or biometrics utilize calibration-sensitive 
evaluation measures, such as the cost functions Cavg in lan- 
guage recognition |15| and the Half Total Error Rate in face 
recognition 1 16|. 

Since around 2004 [1,2 J the concept of calibration in speaker 
recognition has been generalized to a range of operating points 
by using proper scoring rules (17| to evaluate probabilistic 
statements about whether a trial is a same-speaker (target) or 
different- speaker (non-target) trial. A system that represents its 
score as a likelihood-ratio can be well-calibrated over a wide 
range of operating points simultaneously. This representation of 
the speaker recognition score has direct application in speaker 



detection, as the decision threshold follows directly from the 
cost function parameters 1 14], but also in evidence reporting in 
forensic speaker comparison cases |4, 18|. In the NIST SRE 
2012, for the first time, hard decisions were no longer required, 
and instead the recognition score had to be submitted in the 
form of a likelihood-ratio. The evaluation measure effectively 
sampled the decision cost function at two different parame- 
ters] 19, 20|. 

Since a calibrated likelihood-ratio is still just a score, all 
properties of normal scores apply to likelihood-ratios as well, 
and we can draw DET and ROC plots, determine EERs and 
inspect the score distributions. The axis warping of the DET 
plot 1 13 1 in combination with the observed more-or-less straight 
DET curves suggests that target and non-target score distribu- 
tions could be accurately modelled with Gaussians. These score 
distributions and the relation to the DET have been studied pre- 
viously |21[|22) and are very instructive to the understanding of 
basic detection theory and the concepts of calibration [ 14, 23|. 
In this paper we are interested in properties of the distributions 
of calibrated log-likelihood-ratios. This may help situations 
were we carry out a calibration transformation on raw recog- 
nition scores, because it can tell us what the calibrated distribu- 
tions should look like. 

The paper is organized as follows. We define the very na- 
ture of a calibrated likelihood-ratio in Section |2] In Section [5] 
we investigate the properties of log-likelihood-ratio distribu- 
tions when they are Gaussian, and we will then apply these in 
Sectionplas a new method for calibration. We then present ex- 
periments and conclusions. 

2. Likelihood- ratio idempotence 

Here we carefully define the likelihood-ratio (LR) and show that 
it has the interesting property: the LR of the LR is the LR, which 
forms a definition of calibration. 

The speaker recognition system has as input two speech 
segments, denoted X and Y , which it processes in two steps. 
We represent the first step as s = f{X,Y). To keep things 
general, s may represent different kinds of output, e.g., a pair of 
acoustic feature vector sequences, a pair of i-vectors, or just a 
single, scalar recognition score. The second step is to compute 
the likelihood-ratio r as a function of s, as: 



P{s\Hi,M) 
P{s\H2,M) 



(1) 



where Hi is the (target) hypothesis that X and Y originate from 
the same speaker, H2 the (non-target) hypothesis that they are 
from two different speakers, and A^ is a generative probabilistic 
model for s. In current practice, s is always the recognition 
score, so that M merely models scalar scores — not i-vectors, 
acoustic feature sequences or speech signals. But our theory 



below is sufficiently general to remain applicable in future to 
more ambitious models, when s might have a more complex 
form. We now assume there is given the hypothesis prior, n = 
P{Hi), which allows us to express the hypothesis posterior, via 
Bayes' rule as: 



P{Hi \s,M,tt) 



7rr + (1 - 



(2) 



This shows that r is a sufficient statistic: the posterior depends 
on s only through r. This allows rewriting the posterior as: 

P{h\s,M,n) = P{h\r,M',n), h&{Hi,H2} (3) 

where we have introduced A4' to denote M, augmented by as- 
serting ni. Although r contains all the relevant information 
that A4 can extract from s to recognize the unknown hypothe- 
sis, it must be stressed that r and s do not necessarily contain 
all the relevant information that could have been extracted from 
the original input X, Y by some more elaborate model. Now 
we use the odds form of Bayes' rule: 



P{H^\p,M,t:) 
P{H2\p,M,7t) 



TV P{p\Hi,M) 
l-n P{p\ H2,M) 



(4) 



where p is a placeholder for r or s and AI for A4 oi A4' . Com- 
bining this with J3j, we find the desired relationship (the LR of 
theLRistheLR|24|): 



P{s I Hi,M) _ P{r\ Hi,M') 



P{s\H2,M) P{r\H2,M')' 
If we define x to be the log-likelihood-ratio (LLR): 

X = log r 
we also finc{](the LLR of the LLR is the LLR): 



X — loE 



P{x\ Hi,M") 



P{x\H2,M") 

where M" augments M' by addition of (Im. 
2.1. Implications 

Rewriting ^ as: 

P{r\ Hi,M') = rP{r\ H2,M') 



(5) 



(6) 



(7) 



(8) 



we see that if either of the two distributions is given, then the 
other distribution is completely determined — they cannot vary 
independently. Moreover, a further restriction is placed on these 
distributions: since the LHS must integrate to 1, the expected 
value of the non-target distribution (the integral of the RHS) 
must be: (r) — 1. Similarly, for targets: {^) — 1. By applying 
Jensen's inequality 1251 we also find for targets: (x) > and 
for non-targets: (x) < 0. 

2.2. Good and bad calibration 

How does (|5j function as a definition of calibration? Since it is 
an equality, won't all LRs calculated via ifTJ by some model A^, 
just automatically satisfy (pj? Yes they will, but only if A4 and 
A4' are related as explained above. If we want to independently 
judge the goodness of the calibration of r, we do not condition 
the distributions for r on the recognizer's model M. Instead, 
we could empirically observe the target and non-target values 



'To see this, note the log transformation is monotonic and the Jaco- 
bian of the transformation cancels in the ratio. 



of r as calculated by the recognizer over an independent, super- 
vised database of speaker detection trials. Letting O denote the 
empirical observation, we could then say the model A4 is well 
calibrated if: 



P{s\H^,M) ^^ P(r\H^,0) 
P{s\H2,M) ^ P(r\H2,0) 



(9) 



Bad calibration is when the LRs given respectively by the rec- 
ognizer's M and empirical observation O, do not agree in this 
way. This can and does happen, since O is independent of any 
development data that was used to determine the form and pa- 
rameters of M ■ 

It should be noted that ^ does not give a practical recipe 
to judge degree of goodness of calibration — it specifies neither 
how to assign P{r \ h,0), nor how to numerically evaluate the 
agreement between LHS and RHS. For practical solutions for 
calibration-sensitive objective functions, see for example (26) . 

3. Gaussian distributed 
log-likelihood-ratios 

Inspired by the fact that DET curves in speaker recognition 
tend to be straight |21|, we explore a Gaussian solution to the 
LLR distribution constraint (jTl. Since target and non-target 
LLR distributions are so tightly coupled, it turns out that if 
the one is assumed to be Gaussian, then the other must also 
be. We shall use the shorthand: e{x) = P{x \ Hi,M") and 
d{x) = P{x I H2,A4"). Arbitrarily assuming a Gaussian dis- 
tribution for non-targets (diiferent-speaker trials): 



d(x) = M{x j p.d,Od) 



-(a:-Md) /2t^ 



2TVad 



(10) 



We derive the functional form for targets^ e{x), when dTjl ap- 
plies: 



e(x) — e^d{x) — 



x-{x-iJ,a) /2(t2 



2nad 



(11) 



We collect the terms in x in the exponent, which itself can be 
written like 



2fidX - 



2-i 



j4 , 2a-ga; 
2ai 



a;2 - 2{pd + o-l)x + /^l 



i<^l 



{x - [pd + (j'd)f , 2pd(jl + oi 



2'^i 



+ 



2-i 



(12) 



(13) 



(14) 



The first term is in the familiar form of a Gaussian exponent, 
the second will result in a constant factor. Gathering terms, and 
writing 

Me = Md +'^d, (15) 

the expression for the same-speaker comparison log-likelihood- 
ratio scores becomes 



z{x) = 



We see that e{x) is of Gaussian shape, with 

Oe = Od = a. (18) 

^trials where the speakers are equal 



^ ^''rf/a + Md ^-(a:-M?)/2<T2 


(16) 
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e^'^^'+''^f^ix\p„ad). 


(17) 



Since e{x) must be a proper PDF, its integral over x must be 
unity, from which follows that 



(19) 



/^+^'' / J^{x 1 ^^,G)dx = 1 

J —oo 



Finally, with iTTSj we find 



fJ,e = fJ.d+ (^ = — Md = Ml 



(20) 



(21) 



This shows that d{x) and e(a;) are equal variance Gaussians 
with means symmetric around zero at ±/i, and where the vari- 
ance and mean are related d20|( 



2fj,. 



(22) 



3.1. Equal Error Rate and d' 

Using the symmetry of the solution, it is clear that the threshold 
for the equal error rate is at x = 0. Using the expression for the 
miss probability, the equal error rate E^ is 



£= = 



/ Af{x I /i,o-) 

J —oo 



-fi/a- 



M{x\0,l) = ^{-fi/a), 



(23) 



(24) 



where ^{x) is the cumulative normal distribution. 

It is sometimes useful to recognize the parameter d' from 
detection theory, which is the difference in means expressed in 
terms of the standard deviation, here d' = 2fi/a. With \24\ the 
relation becomes 



S= = $(-^d')- 

^ 2 ' 

d' = cr = -2$"^(£^), 



(25) 
(26) 



introducing <^~^(y), the inverse of the cumulative normal dis- 
tribution. The importance of the relations above is that ^ and a 
are determined by the discrimination performance measured 
by £=, using l|22l and J26l 



4. A new calibration method 



(27) 



In practice, automatic speaker recognition systems do not de- 
liver scores that can directly be interpreted as a log-likelihood- 
ratio, even though they are computed as such, for instance in 
the good old UBM-GMM scoring [27] or the latest i-vector 
PLDA scoring |28|. A practical solution to this is to convert 
raw scores s{X, Y) to calibrated log-likelihood-ratios by some 
transformation function x[s), usually constrained to be mono- 
tonic increasing. There are many ways of doing this. The Fo- 
Cal (29) and BOSARIS (30) toolkits use logistic regression to 
discriminatively train linear calibration transformations. Other 
possibilities include isotonic regression (PAV (30)) and line-up 
calibration [9] that uses the rank in a line-up of foil speakers. In 
FoCal or BOSARIS, the score-to-LLR function is affine: 



x[s) = as + b 



(28) 



and the parameters a and b are found by optimizing cross- 
entropy, a calibration-sensitive objective function defined on a 
supervised set of speaker recognition trials. 



Here we contrast the popular discriminative logistic re- 
gression solution to a new generative, constrained maximum- 
likelihood (ML) solution. Our constraints follow from assum- 
ing (i) Gaussian LLR distributions, and (ii) an affine score-to- 
LLR transform {2S) . This implies that (i) the LLR distributions 
are constrained as derived in Section [3] and (ii) the score dis- 
tributions are also Gaussians, with equal variances. With no 
LLR distribution constraints, we would have had 6 free param- 
eters: 2 means, 2 variances and 2 calibration parameters. But 
we have imposed 3 constraints, equal variances jl8| >, symmetric 
means pTl and \22\ . We find the remaining 3 free parameters 
by maximizing the following weighted likelihood: 



^logA/'(s, I me,«) + 



iV, 



ie£ 



Nd 



^logA/'(si I md,v) 



where £ and D index N^ target, and Nd non-target scores, 
weighted by a and 1 — a, respectively. The score distribution 
parameters that need to be optimized are the means me,md and 
common variance v. Setting derivatives to 0, we find the maxi- 
mum likelihood at the sample means: 



N, 



^E^- 



rud = 



^i:^ 



iSS 



Nd 



iSTi 



and at a weighted combination of sample variances: 



N, 



E(^ 



me) + 



Nd 



E(«- 



■ rud 



(29) 



(30) 



By l |28| , the LLR distribution parameters become a^ = a'^v, 
fie = ame + b and fid = aiUd + b. Finally, applying the 
constraints a^ = fie — fJ-d and fie = —fid, we can solve for the 
calibration parameters: 



md 



b = —a 



nie + rUd 



(31) 



We call this recipe constrained, maximum-likelihood, Gaussian 
(CMLG) calibration. An advantage of CMLG is that it has a 
closed form, in contrast to the iterative optimization required 
by logistic regression. 

4.1. Experiment 

In order to test CMLG we apply it to a number of recognition 
trials sets. We use a set of trials crafted for duration-dependence 
experiments |8| from the NIST SRE 2008 and 2010 trial sets, 
the telephone-telephone "extended" trial lists. We constructed 
short duration segments of 5, 10, 20, and 40 seconds from both 
train and test segments by simply selecting the first frames af- 
ter speech activity detection. All durations, including the full 
conversation side, were tested in all combinations, leading to 
25 different trial lists. The NIST SREIO 'det-5' performance 
over these lists ranges from £= = 2.9-26 %. The recognition 
system is a standard i-vector based system with PLDA scoring 
described elsewhere f201. 

We contrast CMLG (with a = |) to the traditional logistic 
regression method. The calibrations are trained on NIST SRE 
2008 data (427 375 trials) and applied to SRE 2010 trials for 
evaluation (10007 900 trials), all gender mixed. We evaluate 
the 25 different trial list combinations using Ciir, a cost function 
that is sensitive to calibration over the whole DET curve |2|. We 
used R's glm routine for logistic regression. 

The results are shown in Fig.fT] where we have plotted the 
Ciir obtained using CMLG calibration versus Ciir obtained us- 
ing logistic regression. The values are highly correlated. For 
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constrained maximum iil<eiihood Gaussian (CIVILG) 

Figure 1: Ciir values of the 25 trial lists for the CMLG method 
(horizontal) versus logistic regression (vertical). 

CMLG, the average Ciir over all 25 conditions is 0.375, for lo- 
gistic regression it is 0.376. These can be called good, as the 
mean CJ^ is 0.370. 

We have also used the NIST SRE12 scores from the ABC- 
team to study the effect of a in ( |30^ to another calibration sen- 
sitive measure Cprimary, cf. Fig.|2| for details we refer to | |26| . 
The figure shows that with CMLG good calibration results can 
be obtained for a different system with different data and a dif- 
ferent performance measure, if the correct a is chosen. 

5. Discussion and Conclusions 

We have shown in this paper, that if the different- speaker cal- 
ibrated log-likelihood-ratio scores from a speaker recognition 
system follow a Gaussian distribution, then the distribution of 
the same-speaker scores must also be Gaussian after calibra- 
tion, with the same variance but opposite mean. Because mono- 
tonically increasing score-to-likelihood-ratio functions do not 
change the DET plot, such equal-variance distributions in the 
calibrated score domain imply 45° DET-plots in the raw score 
domain as well — which is neither observed with real dat^jnor 
desired for applications operating in the low false alarm region. 
The logical conclusion then is that real scores, if they are well- 
calibrated, will not be Gaussian. However, we see that our 
PLDA system can be calibrated quite well under the Gaussian 
assumptions, and indeed we have noticed that i-vector PLDA 
systems tend to have score distributions that appear more Gaus- 
sian than earlier technologies, such as i-vector LDA cosine dis- 
tance scoring, support vector machines or the UBM-GMM like- 
lihood ratio scoring. 

The Gaussian solution to the LLR equation IJT) is one where 
both distributions are shaped by the same mathematical func- 
tion. In signal detection theory, where the distribution repre- 
sents noise, this seems almost mandatory, but in speaker recog- 
nition this is not an obvious assumption. We have experimented 
with other distributions, e.g., in the likelihood-ratio domain IJSJ 




-8 -6 -4-2 2 

iog(a)-iog(1-a) 

Figure 2: Cprimary for logistic regression and CMLG calibra- 
tion methods for ABC's SRE12 submission, as a function of 
prior a used in the objective / ML optimization. 

a pair of Gamma distributions is a solution to the calibration 
condition, and these are asymmetric in the log-likelihood-ratio 
domain. However, such distributions seem to be not at all rep- 
resentative of real score distributions. Also, an arbitrary linear 
combination of Gaussians with different means and correspond- 
ing variances is a solution to (ITl which allows some freedom in 
fitting a shape of score distribution. In principle, there is no 
need for real score distributions to follow any mathematical de- 
scription, but we have observed that many researchers like to 
use some form of idealized shape of the score distributions to 
understand the data |4 21 1. When calibration methods are de- 
signed, condition ^ should therefore be taken into account. 

The relations derived in Section [3] open up more possibili- 
ties for relations between the various evaluation measures. For 
instance, we can compute Ciir by numerical integration as 



log 2 



Af{x I fj., a) log(l -f e ^) dx 



(32) 



^We have mea.sured the slope of the DET in the conventional error 
region 0.1-50% for the data in the experiment. The mean slope over 
the 25 conditions is —0.99 with a standard deviation of 0.06, so in fact 
this data appears to honour the equal variance condition quite well. 



and this relates Cur to £= via l |26[ ) and {17) for Gaussian score 
distributions. E.g., for our set of 25 trial lists this expression 
differs from Q™'" only 0.006 in root mean squared difference, 
or about 2 %. Instead of for calibration, the relations can also 
be used for fusion of systems. For pre-calibrated systems this 
leads to solutions that transparently depend on the correlation 
between the scores. 

The fact that we can obtain the linear calibration parame- 
ters under the Gaussian assumption is an interesting side-effect 
of this study. The calibration parameters can be expressed in 
closed-form, and do not explicitly consider cross entropy or Ciir 
as an optimization objective. For score distributions that do not 
resemble a Gaussian, this calibration method is likely to fail — 
we therefore do not recommend CMLG calibration as a general 
technique. Still, we are quite pleased that the experiments sup- 
port the mostly theoretical results of this paper. 
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